Fowlkes–Mallows Score (fowlkes_mallows_score)#
The Fowlkes–Mallows index (FMI) measures how similar two labelings of the same n samples are.
It is most often used for external clustering validation:
labels_true: ground-truth classes (or a reference clustering)labels_pred: cluster assignments produced by an algorithm
FMI is a pair-counting metric: it looks at all pairs (i, j) and checks whether each pair is placed in the same cluster in both labelings.
Goals#
Build intuition with tiny examples + pairwise heatmaps.
Derive the FMI formula (pair precision/recall) and compute it via a contingency matrix.
Implement FMI from scratch in NumPy and match scikit-learn.
See how FMI reacts to permutations, merges/splits, and label noise.
Use FMI as a model-selection objective for a simple logistic regression classifier.
Quick import#
from sklearn.metrics import fowlkes_mallows_score
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from sklearn.datasets import make_blobs, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import fowlkes_mallows_score as sk_fowlkes_mallows_score
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)
import sys
import sklearn
import plotly
print("python:", sys.version.split()[0])
print("numpy :", np.__version__)
print("sklearn:", sklearn.__version__)
print("plotly:", plotly.__version__)
python: 3.12.9
numpy : 1.26.2
sklearn: 1.6.0
plotly: 6.5.2
1) Pair-counting view (what FMI actually measures)#
Let \(y^{(T)}\) be the true labels and \(y^{(P)}\) be the predicted labels.
For each pair of samples \((i, j)\) with \(i<j\), define:
same-true: \(\mathbb{1}[y^{(T)}_i = y^{(T)}_j]\)
same-pred: \(\mathbb{1}[y^{(P)}_i = y^{(P)}_j]\)
This turns clustering comparison into a binary decision per pair (together vs apart), yielding pairwise counts:
TP: together in both true and predicted
FP: together in predicted, apart in true
FN: together in true, apart in predicted
TN: apart in both (not used by FMI)
Pairwise precision and recall:
The Fowlkes–Mallows index is the geometric mean:
Notes:
FMI is label-permutation invariant (cluster IDs are arbitrary).
FMI ignores true negatives (pairs separated in both clusterings). With many clusters, TN pairs dominate, so ignoring them keeps the metric focused on who gets grouped together.
def pair_confusion_counts_bruteforce(labels_true, labels_pred):
"""Pair confusion counts (TP, FP, FN, TN) in O(n^2).
A pair (i, j) with i<j is a 'positive' if the two samples are in the same cluster.
"""
labels_true = np.asarray(labels_true).ravel()
labels_pred = np.asarray(labels_pred).ravel()
if labels_true.shape != labels_pred.shape:
raise ValueError(f"shape mismatch: true{labels_true.shape} vs pred{labels_pred.shape}")
n = labels_true.size
tp = fp = fn = tn = 0
for i in range(n - 1):
for j in range(i + 1, n):
same_true = labels_true[i] == labels_true[j]
same_pred = labels_pred[i] == labels_pred[j]
if same_true and same_pred:
tp += 1
elif (not same_true) and same_pred:
fp += 1
elif same_true and (not same_pred):
fn += 1
else:
tn += 1
return tp, fp, fn, tn
def fowlkes_mallows_from_pair_counts(tp, fp, fn):
den = (tp + fp) * (tp + fn)
if den == 0:
return 0.0
return float(tp / np.sqrt(den))
# Tiny example: 8 items, 3 true clusters
labels_true = np.array([0, 0, 0, 1, 1, 2, 2, 2])
# A predicted clustering that *merges* clusters 1 and 2
labels_pred = np.array([0, 0, 0, 1, 1, 1, 1, 1])
tp, fp, fn, tn = pair_confusion_counts_bruteforce(labels_true, labels_pred)
precision_pair = tp / (tp + fp) if (tp + fp) else 0.0
recall_pair = tp / (tp + fn) if (tp + fn) else 0.0
fmi = fowlkes_mallows_from_pair_counts(tp, fp, fn)
n = labels_true.size
pairs_total = n * (n - 1) // 2
print(f"n={n}, total pairs={pairs_total}")
print(f"TP={tp}, FP={fp}, FN={fn}, TN={tn}")
print(f"pair-precision={precision_pair:.3f}, pair-recall={recall_pair:.3f}")
print(f"FMI (pair counts)={fmi:.3f}")
print(f"FMI (sklearn) ={sk_fowlkes_mallows_score(labels_true, labels_pred):.3f}")
n=8, total pairs=28
TP=7, FP=6, FN=0, TN=15
pair-precision=0.538, pair-recall=1.000
FMI (pair counts)=0.734
FMI (sklearn) =0.734
def same_cluster_matrix(labels):
labels = np.asarray(labels).ravel()
return labels[:, None] == labels[None, :]
same_true = same_cluster_matrix(labels_true)
same_pred = same_cluster_matrix(labels_pred)
# Encode each (i,j) pair into a category:
# -1 = diagonal, 0 = TN, 1 = FN, 2 = FP, 3 = TP
pair_type = np.zeros_like(same_true, dtype=int)
pair_type[(same_true) & (same_pred)] = 3
pair_type[(~same_true) & (same_pred)] = 2
pair_type[(same_true) & (~same_pred)] = 1
pair_type[(~same_true) & (~same_pred)] = 0
np.fill_diagonal(pair_type, -1)
colorscale_pair = [
[0.00, "#e0e0e0"],
[0.125, "#e0e0e0"], # -1 diag
[0.125, "#ffffff"],
[0.375, "#ffffff"], # 0 TN
[0.375, "#d73027"],
[0.625, "#d73027"], # 1 FN
[0.625, "#4575b4"],
[0.875, "#4575b4"], # 2 FP
[0.875, "#1a9850"],
[1.00, "#1a9850"], # 3 TP
]
fig = make_subplots(
rows=1,
cols=3,
subplot_titles=[
"Same cluster? (true)",
"Same cluster? (pred)",
"Pair types (TP/FP/FN/TN)",
],
)
fig.add_trace(
go.Heatmap(z=same_true.astype(int), colorscale="Greys", showscale=False),
row=1,
col=1,
)
fig.add_trace(
go.Heatmap(z=same_pred.astype(int), colorscale="Greys", showscale=False),
row=1,
col=2,
)
fig.add_trace(
go.Heatmap(
z=pair_type,
zmin=-1,
zmax=3,
colorscale=colorscale_pair,
colorbar=dict(
title="pair",
tickvals=[-1, 0, 1, 2, 3],
ticktext=["diag", "TN", "FN", "FP", "TP"],
),
),
row=1,
col=3,
)
fig.update_layout(
title="FMI compares clusterings by counting sample pairs",
height=380,
)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
2) Efficient computation via a contingency matrix#
The brute-force pair loop is \(O(n^2)\).
A much faster way uses the contingency matrix \(N\):
rows = true clusters
columns = predicted clusters
\(N_{ij}\) = number of samples that are in true cluster \(i\) and predicted cluster \(j\)
Let \(\binom{m}{2} = \frac{m(m-1)}{2}\) be the number of unordered pairs inside a group of size \(m\).
Then:
Also define row and column sums:
Pairs placed together by the predicted clustering:
Pairs placed together by the true clustering:
So:
This avoids enumerating all pairs.
def comb2(m):
m = np.asarray(m, dtype=np.int64)
return m * (m - 1) // 2
def contingency_matrix_numpy(labels_true, labels_pred):
labels_true = np.asarray(labels_true).ravel()
labels_pred = np.asarray(labels_pred).ravel()
if labels_true.shape != labels_pred.shape:
raise ValueError(f"shape mismatch: true{labels_true.shape} vs pred{labels_pred.shape}")
_, t = np.unique(labels_true, return_inverse=True)
_, p = np.unique(labels_pred, return_inverse=True)
n_true = int(t.max()) + 1 if t.size else 0
n_pred = int(p.max()) + 1 if p.size else 0
cm = np.zeros((n_true, n_pred), dtype=np.int64)
np.add.at(cm, (t, p), 1)
return cm
cm = contingency_matrix_numpy(labels_true, labels_pred)
tp_fast = int(comb2(cm).sum())
pred_pairs = int(comb2(cm.sum(axis=0)).sum()) # TP + FP
true_pairs = int(comb2(cm.sum(axis=1)).sum()) # TP + FN
fmi_fast = 0.0 if (pred_pairs == 0 or true_pairs == 0) else float(tp_fast / np.sqrt(pred_pairs * true_pairs))
print("contingency matrix N (rows=true, cols=pred):
", cm)
print(f"TP={tp_fast}, TP+FP={pred_pairs}, TP+FN={true_pairs}")
print("FMI (from contingency) =", round(fmi_fast, 6))
Cell In[5], line 31
print("contingency matrix N (rows=true, cols=pred):
^
SyntaxError: unterminated string literal (detected at line 31)
3) NumPy implementation (from scratch)#
Below is a minimal implementation that mirrors the scikit-learn definition.
Implementation notes:
We use the contingency-matrix formula (fast, no \(O(n^2)\) loops).
If either labeling has no within-cluster pairs (all singleton clusters), scikit-learn returns
0.0.
def fowlkes_mallows_score_numpy(labels_true, labels_pred):
"""Fowlkes–Mallows score between two labelings.
Parameters
----------
labels_true, labels_pred : array-like, shape (n_samples,)
Two labelings of the same samples. Values can be any hashable type.
Returns
-------
fmi : float
In [0, 1]. 1 means identical pairwise grouping.
"""
cm = contingency_matrix_numpy(labels_true, labels_pred)
tp = int(comb2(cm).sum())
pred_pairs = int(comb2(cm.sum(axis=0)).sum()) # TP + FP
true_pairs = int(comb2(cm.sum(axis=1)).sum()) # TP + FN
if pred_pairs == 0 or true_pairs == 0:
return 0.0
return float(tp / np.sqrt(pred_pairs * true_pairs))
# Quick correctness checks vs scikit-learn
def check_case(name, y_t, y_p):
a = fowlkes_mallows_score_numpy(y_t, y_p)
b = sk_fowlkes_mallows_score(y_t, y_p)
ok = np.isclose(a, b)
print(f"{name:28s} numpy={a:.6f} sklearn={b:.6f} ok={ok}")
return ok
ok_all = True
ok_all &= check_case("perfect (permute labels)", [0, 0, 1, 1], [1, 1, 0, 0])
ok_all &= check_case("all singletons", np.arange(6), np.arange(6))
ok_all &= check_case("all same cluster", np.zeros(6, dtype=int), np.zeros(6, dtype=int))
ok_all &= check_case("random labels", rng.integers(0, 4, 80), rng.integers(0, 5, 80))
# A few randomized trials
for _ in range(10):
y_t = rng.integers(0, rng.integers(2, 8), size=200)
y_p = rng.integers(0, rng.integers(2, 9), size=200)
ok_all &= np.isclose(fowlkes_mallows_score_numpy(y_t, y_p), sk_fowlkes_mallows_score(y_t, y_p))
print("
All checks passed?", ok_all)
Cell In[7], line 23
print("
^
SyntaxError: unterminated string literal (detected at line 23)
4) Behavior: permutations, merges/splits, and label noise#
We’ll create a 2D dataset with obvious clusters and compare several predicted labelings:
perfect: identical to the ground truth
permuted: same clustering, different numeric IDs (should score 1.0)
merged: two clusters merged into one (more false positives)
split: one cluster split into two (more false negatives)
noisy: a fraction of labels randomly corrupted
X, y_true = make_blobs(
n_samples=450,
centers=3,
cluster_std=(0.9, 1.1, 0.8),
random_state=7,
)
# Different predicted labelings
y_perfect = y_true.copy()
y_permuted = (y_true + 1) % 3
y_merged = y_true.copy()
y_merged[y_merged == 2] = 1 # merge cluster 2 into cluster 1
# Split cluster 0 into two by x-coordinate
mask0 = y_true == 0
x0 = X[mask0, 0]
cut = np.median(x0)
y_split = y_true.copy()
y_split[mask0 & (X[:, 0] > cut)] = 3
# Add label noise
noise_frac = 0.12
idx = rng.choice(X.shape[0], size=int(noise_frac * X.shape[0]), replace=False)
y_noisy = y_true.copy()
y_noisy[idx] = rng.integers(0, 4, size=idx.size)
noisy_key = f"noisy ({noise_frac:.0%})"
cases = {
"true": y_true,
"perfect": y_perfect,
"permuted": y_permuted,
"merged": y_merged,
"split": y_split,
noisy_key: y_noisy,
}
scores = {name: fowlkes_mallows_score_numpy(y_true, y) for name, y in cases.items() if name != "true"}
for name, s in scores.items():
print(f"{name:14s} FMI={s:.4f}")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[8], line 41
30 noisy_key = f"noisy ({noise_frac:.0%})"
32 cases = {
33 "true": y_true,
34 "perfect": y_perfect,
(...) 38 noisy_key: y_noisy,
39 }
---> 41 scores = {name: fowlkes_mallows_score_numpy(y_true, y) for name, y in cases.items() if name != "true"}
43 for name, s in scores.items():
44 print(f"{name:14s} FMI={s:.4f}")
Cell In[6], line 15, in fowlkes_mallows_score_numpy(labels_true, labels_pred)
1 def fowlkes_mallows_score_numpy(labels_true, labels_pred):
2 """Fowlkes–Mallows score between two labelings.
3
4 Parameters
(...) 12 In [0, 1]. 1 means identical pairwise grouping.
13 """
---> 15 cm = contingency_matrix_numpy(labels_true, labels_pred)
17 tp = int(comb2(cm).sum())
18 pred_pairs = int(comb2(cm.sum(axis=0)).sum()) # TP + FP
NameError: name 'contingency_matrix_numpy' is not defined
# Visualize true labels and a few predicted labelings
palette = px.colors.qualitative.Set2
def label_colors(labels):
labels = np.asarray(labels)
uniq = np.unique(labels)
color_map = {int(k): palette[i % len(palette)] for i, k in enumerate(uniq)}
return [color_map[int(k)] for k in labels]
fig = make_subplots(
rows=2,
cols=3,
subplot_titles=[
"ground truth",
f"perfect (FMI={scores['perfect']:.3f})",
f"permuted (FMI={scores['permuted']:.3f})",
f"merged (FMI={scores['merged']:.3f})",
f"split (FMI={scores['split']:.3f})",
f"noisy (FMI={scores[noisy_key]:.3f})",
],
)
panels = [
("true", y_true),
("perfect", y_perfect),
("permuted", y_permuted),
("merged", y_merged),
("split", y_split),
("noisy", y_noisy),
]
for k, (_, y) in enumerate(panels):
r = 1 + k // 3
c = 1 + k % 3
fig.add_trace(
go.Scatter(
x=X[:, 0],
y=X[:, 1],
mode="markers",
marker=dict(size=6, color=label_colors(y), opacity=0.85),
showlegend=False,
),
row=r,
col=c,
)
fig.update_layout(height=650, title="Same points, different labelings → different FMI")
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[9], line 18
9 color_map = {int(k): palette[i % len(palette)] for i, k in enumerate(uniq)}
10 return [color_map[int(k)] for k in labels]
13 fig = make_subplots(
14 rows=2,
15 cols=3,
16 subplot_titles=[
17 "ground truth",
---> 18 f"perfect (FMI={scores['perfect']:.3f})",
19 f"permuted (FMI={scores['permuted']:.3f})",
20 f"merged (FMI={scores['merged']:.3f})",
21 f"split (FMI={scores['split']:.3f})",
22 f"noisy (FMI={scores[noisy_key]:.3f})",
23 ],
24 )
26 panels = [
27 ("true", y_true),
28 ("perfect", y_perfect),
(...) 32 ("noisy", y_noisy),
33 ]
35 for k, (_, y) in enumerate(panels):
NameError: name 'scores' is not defined
# FMI as we increase label noise
def corrupt_labels(y, frac, *, rng):
y = np.asarray(y).copy()
n = y.size
m = int(frac * n)
if m == 0:
return y
idx = rng.choice(n, size=m, replace=False)
labels = np.unique(y)
y[idx] = rng.choice(labels, size=m)
return y
noise_grid = np.linspace(0, 0.8, 41)
fmis = []
for frac in noise_grid:
y_corrupt = corrupt_labels(y_true, frac, rng=rng)
fmis.append(fowlkes_mallows_score_numpy(y_true, y_corrupt))
fig = go.Figure()
fig.add_trace(go.Scatter(x=noise_grid, y=fmis, mode="lines+markers"))
fig.update_layout(
title="FMI decreases as we corrupt more labels",
xaxis_title="fraction of labels randomly reassigned",
yaxis_title="Fowlkes–Mallows index",
)
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[10], line 20
18 for frac in noise_grid:
19 y_corrupt = corrupt_labels(y_true, frac, rng=rng)
---> 20 fmis.append(fowlkes_mallows_score_numpy(y_true, y_corrupt))
22 fig = go.Figure()
23 fig.add_trace(go.Scatter(x=noise_grid, y=fmis, mode="lines+markers"))
Cell In[6], line 15, in fowlkes_mallows_score_numpy(labels_true, labels_pred)
1 def fowlkes_mallows_score_numpy(labels_true, labels_pred):
2 """Fowlkes–Mallows score between two labelings.
3
4 Parameters
(...) 12 In [0, 1]. 1 means identical pairwise grouping.
13 """
---> 15 cm = contingency_matrix_numpy(labels_true, labels_pred)
17 tp = int(comb2(cm).sum())
18 pred_pairs = int(comb2(cm.sum(axis=0)).sum()) # TP + FP
NameError: name 'contingency_matrix_numpy' is not defined
5) Using FMI to tune a simple model (logistic regression)#
FMI expects hard labels. A classifier like logistic regression produces scores (probabilities), and we choose a threshold \(t\):
Two important practical points:
FMI is not differentiable w.r.t. model parameters because it depends on discrete predictions.
A common workflow is:
fit the model with a differentiable loss (e.g. log loss)
select hyperparameters / thresholds by maximizing FMI on a validation set (grid search)
Even though FMI is more common for clustering, it is defined for any pair of labelings — so it can also compare class labels.
def add_intercept(X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
return np.c_[np.ones((X.shape[0], 1)), X]
def sigmoid(z):
z = np.asarray(z, dtype=float)
out = np.empty_like(z)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
ez = np.exp(z[~pos])
out[~pos] = ez / (1.0 + ez)
return out
def log_loss_from_proba(y_true, p, eps=1e-15):
y_true = np.asarray(y_true, dtype=float)
p = np.clip(np.asarray(p, dtype=float), eps, 1 - eps)
return -np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
def fit_logistic_regression_gd(
X,
y,
*,
lr=0.2,
max_iter=3000,
alpha=0.0,
tol=1e-8,
):
"""Binary logistic regression with gradient descent + optional L2 penalty."""
Xb = add_intercept(X)
y = np.asarray(y, dtype=float).ravel()
n, d = Xb.shape
w = np.zeros(d)
history = []
for _ in range(max_iter):
p = sigmoid(Xb @ w)
loss = log_loss_from_proba(y, p) + 0.5 * alpha * np.sum(w[1:] ** 2)
history.append(loss)
grad = (Xb.T @ (p - y)) / n
grad[1:] += alpha * w[1:]
w_new = w - lr * grad
if np.linalg.norm(w_new - w) < tol:
w = w_new
break
w = w_new
return w, np.asarray(history)
def predict_proba_logreg(X, w):
return sigmoid(add_intercept(X) @ w)
# A 2D dataset so we can visualize the decision boundary.
X, y = make_classification(
n_samples=900,
n_features=2,
n_redundant=0,
n_informative=2,
n_clusters_per_class=1,
class_sep=1.1,
flip_y=0.05,
random_state=7,
)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.35, random_state=7, stratify=y
)
w, history = fit_logistic_regression_gd(X_train, y_train, lr=0.25, max_iter=4000, alpha=1e-2)
fig = go.Figure()
fig.add_trace(go.Scatter(y=history, mode="lines"))
fig.update_layout(title="Logistic regression training loss (log loss)", xaxis_title="iteration", yaxis_title="loss")
fig.show()
# Sweep thresholds and choose the one that maximizes FMI on the validation set
p_val = predict_proba_logreg(X_val, w)
t_grid = np.linspace(0.01, 0.99, 199)
fmi_grid = []
for t in t_grid:
y_hat = (p_val >= t).astype(int)
fmi_grid.append(fowlkes_mallows_score_numpy(y_val, y_hat))
fmi_grid = np.asarray(fmi_grid)
best_idx = int(np.argmax(fmi_grid))
best_t = float(t_grid[best_idx])
best_fmi = float(fmi_grid[best_idx])
print(f"best threshold t*={best_t:.3f} -> FMI={best_fmi:.4f}")
fig = go.Figure()
fig.add_trace(go.Scatter(x=t_grid, y=fmi_grid, mode="lines+markers", name="FMI(t)"))
fig.add_vline(x=0.5, line_dash="dash", line_color="gray")
fig.add_vline(x=best_t, line_dash="dash", line_color="black")
fig.update_layout(
title="Choose a decision threshold by maximizing FMI",
xaxis_title="threshold t",
yaxis_title="Fowlkes–Mallows index",
)
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[12], line 10
8 for t in t_grid:
9 y_hat = (p_val >= t).astype(int)
---> 10 fmi_grid.append(fowlkes_mallows_score_numpy(y_val, y_hat))
12 fmi_grid = np.asarray(fmi_grid)
14 best_idx = int(np.argmax(fmi_grid))
Cell In[6], line 15, in fowlkes_mallows_score_numpy(labels_true, labels_pred)
1 def fowlkes_mallows_score_numpy(labels_true, labels_pred):
2 """Fowlkes–Mallows score between two labelings.
3
4 Parameters
(...) 12 In [0, 1]. 1 means identical pairwise grouping.
13 """
---> 15 cm = contingency_matrix_numpy(labels_true, labels_pred)
17 tp = int(comb2(cm).sum())
18 pred_pairs = int(comb2(cm.sum(axis=0)).sum()) # TP + FP
NameError: name 'contingency_matrix_numpy' is not defined
# Visualize how changing the threshold shifts the decision boundary
fig = go.Figure()
# points
fig.add_trace(
go.Scatter(
x=X_val[:, 0],
y=X_val[:, 1],
mode="markers",
marker=dict(size=6, color=y_val, colorscale=[[0, "#4575b4"], [1, "#d73027"]], opacity=0.8),
name="validation points",
)
)
w0, w1, w2 = w
def decision_line_xy(w0, w1, w2, threshold, x1_grid):
# p = sigmoid(w0 + w1*x1 + w2*x2) >= t <=> w0 + w1*x1 + w2*x2 >= log(t/(1-t))
logit_t = np.log(threshold / (1 - threshold))
if abs(w2) < 1e-12:
return None
x2 = (logit_t - w0 - w1 * x1_grid) / w2
return x2
x1_min, x1_max = X_val[:, 0].min() - 0.5, X_val[:, 0].max() + 0.5
x1_grid = np.linspace(x1_min, x1_max, 200)
x2_05 = decision_line_xy(w0, w1, w2, 0.5, x1_grid)
x2_best = decision_line_xy(w0, w1, w2, best_t, x1_grid)
if x2_05 is not None:
fig.add_trace(
go.Scatter(
x=x1_grid,
y=x2_05,
mode="lines",
line=dict(color="gray", dash="dash"),
name="boundary t=0.5",
)
)
if x2_best is not None:
fig.add_trace(
go.Scatter(
x=x1_grid,
y=x2_best,
mode="lines",
line=dict(color="black", dash="dash"),
name=f"boundary t*={best_t:.3f}",
)
)
fig.update_layout(
title="Threshold choice changes which points are predicted positive",
xaxis_title="x1",
yaxis_title="x2",
)
fig.show()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[13], line 32
29 x1_grid = np.linspace(x1_min, x1_max, 200)
31 x2_05 = decision_line_xy(w0, w1, w2, 0.5, x1_grid)
---> 32 x2_best = decision_line_xy(w0, w1, w2, best_t, x1_grid)
34 if x2_05 is not None:
35 fig.add_trace(
36 go.Scatter(
37 x=x1_grid,
(...) 42 )
43 )
NameError: name 'best_t' is not defined
6) Pros, cons, and when to use FMI#
Pros
Permutation invariant: cluster IDs can be renamed without changing the score.
Interpretable as \(\sqrt{\text{pair-precision}\cdot\text{pair-recall}}\).
Works with different numbers of clusters in the two labelings.
Bounded in \([0,1]\) (higher is better).
Cons / pitfalls
Not adjusted for chance: random labelings can score non-zero depending on cluster sizes (compare with
adjusted_rand_score).Ignores TN pairs (pairs separated in both). That is often desirable, but it means FMI is not a full “pair accuracy” metric.
Can be unintuitive when clusters are tiny: if a labeling has all singleton clusters, there are no within-cluster pairs → FMI is
0.0(even if the two labelings match).FMI is non-differentiable w.r.t. model parameters (it depends on discrete assignments), so you typically use it for model selection (grid search), not direct gradient optimization.
Good use cases
External clustering evaluation when you have ground truth labels.
Comparing two different clusterings of the same dataset (algorithm comparison, stability across runs).
Selecting clustering hyperparameters (e.g. choosing a cut level in hierarchical clustering) when a reference labeling exists.
Not a good fit
Purely unsupervised evaluation (no reference labels): consider internal metrics like
silhouette_scoreordavies_bouldin_score.When you care about per-class errors, false positives vs false negatives as samples (use classification metrics like F1 / ROC-AUC).
7) Exercises#
Construct a case where one clustering is a refinement of the other (every predicted cluster is a subset of a true cluster). What happens to pair precision vs pair recall?
Compare FMI vs
adjusted_rand_scoreon random labelings as you vary the number of clusters.In the logistic regression example, compare the threshold that maximizes FMI vs the threshold that maximizes F1.
References#
Fowlkes, E. B., & Mallows, C. L. (1983). A Method for Comparing Two Hierarchical Clusterings. Journal of the American Statistical Association.
scikit-learn:
sklearn.metrics.fowlkes_mallows_scoredocumentation.